Skip to content

PR to fetch Updates from the sim main (don't merge)#529

Open
SaitejaSankoji wants to merge 30 commits into
arenadeveloper02:mainfrom
simstudioai:main
Open

PR to fetch Updates from the sim main (don't merge)#529
SaitejaSankoji wants to merge 30 commits into
arenadeveloper02:mainfrom
simstudioai:main

Conversation

@SaitejaSankoji
Copy link
Copy Markdown
Collaborator

Summary

Brief description of what this PR does and why.

Fixes #(issue)

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

How has this been tested? What should reviewers focus on?

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

TheodoreSpeaks and others added 30 commits June 3, 2026 03:08
…rigger faults (#4860)

* fix(webhook): don't fault trigger run on user/workflow execution errors

Webhook-triggered executions re-threw every error, so trigger.dev marked
the run failed and fired #eng-errors alerts. The vast majority of these are
user-caused workflow failures (missing required fields, invalid field
references, bad URLs, provider 4xx, expired models, low credit) that are
already recorded in the execution logs.

Distinguish fault vs error in executeWebhookJobInternal: when the failure
was finalized by core (the workflow ran and its failure is logged), complete
the run with { success: false } instead of throwing. Errors that were not
finalized came from the webhook pipeline itself and still re-throw to fault
the run. Await waitForPostExecution first so the finalized flag is reliable.

The error is still recorded on the run's OTel span via recordException (no
ERROR status, so the run isn't faulted) and remains in the execution logs,
so these stay investigable in Tempo/Loki without false alerts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(schedule): don't fault trigger run on error-recovery failures

The schedule task already treats workflow-execution failures as recorded
errors rather than trigger faults, but the outermost catch's own recovery
code (the infra-retry and releaseClaim calls) was unguarded. A secondary DB
blip while releasing the claim re-threw and escaped run(), faulting the
trigger.dev run and firing an alert — a double-fault during cleanup.

Wrap the recovery path in a try/catch: log and record the exception on the
span without re-throwing. The claim expires on its TTL and the next tick
re-claims the schedule, so swallowing the cleanup failure is safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(webhook): assert waitForPostExecution runs on the non-finalized path

Guards the race fix on the infra-error path so a future refactor can't
silently drop the await. Addresses Greptile review feedback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le storage (#4865)

* feat(storage): support S3-compatible endpoints (R2, MinIO, B2) for file storage

Add S3_ENDPOINT and S3_FORCE_PATH_STYLE env vars, wired into the shared upload
S3 client so Cloudflare R2, MinIO, Backblaze B2, and other S3-compatible stores
work for self-hosted file storage. The endpoint is trusted operator config (no
SSRF/HTTPS gate). Makes the multipart Location fallback endpoint-aware, extends
the S3 client unit tests, and documents the new vars in Helm values, .env.example,
and the English self-hosting docs (incl. browser-reachability + CORS guidance).

* docs(storage): add RustFS as an S3-compatible provider example

* fix(storage): address review feedback and fix env mock for CI

- Add envBoolean to the shared env test mock (createEnvMock) so config.ts's
  forcePathStyle coercion resolves — fixes failing knowledge/utils.test.ts
- Declare S3_FORCE_PATH_STYLE as z.string() (every other env var's pattern);
  it's coerced via envBoolean at the consumption site, avoiding a boolean
  type that never matches the string process.env value
- Log path-style from S3_CONFIG.forcePathStyle (envBoolean) instead of a
  separate isTruthy call, so the startup log can't disagree with the client
- Make buildObjectFallbackUrl honor forcePathStyle: virtual-hosted-style URL
  (bucket as subdomain) for R2, path-style only when forcePathStyle is set

* docs(storage): add backlinks to S3-compatible providers (R2, MinIO, Ceph, B2, RustFS) and backends
* fix(auth): link SSO sign-in to existing same-email accounts

SSO sign-ins failed with "account not linked" (then a cascading "Invalid
callbackURL") when an account with the same email already existed. Better
Auth's `@better-auth/sso` plugin hardcodes the provisioned user's
`emailVerified: options?.trustEmailVerified ? <claim> : false`, so with the
option unset every SSO login arrived unverified and tripped the account
linking gate `(!isTrustedProvider && !userInfo.emailVerified)` whenever the
provider was not in `accountLinking.trustedProviders`.

- Set `trustEmailVerified: true` on the SSO plugin so the IdP's verified-email
  claim is honored (Okta, Entra ID, Google Workspace, Auth0 all assert it).
- Trust the operator's configured provider for linking: merge
  `SSO_PROVIDER_ID` (when present in the app env) plus a new
  `SSO_TRUSTED_PROVIDER_IDS` list into `trustedProviders`. Empty/unset =>
  no-op, so existing deployments are unchanged.
- Invite callback URL: return a clean `/invite/<id>` (token already persists
  in sessionStorage) so an appended `?error=` cannot produce a malformed URL.
- Document `SSO_TRUSTED_PROVIDER_IDS` in SSO docs, Helm values, and schema.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(auth): address review — guard trusted SSO providers, revert invite callback

- Only compute additionalTrustedSsoProviders when SSO_ENABLED, so
  trustedProviders is exactly unchanged for non-SSO deployments.
- Revert the invite getCallbackUrl change: keep the token in the callback URL
  (with sessionStorage/searchParams fallback) so the token survives when
  sessionStorage is unavailable. The account-linking fix removes the
  "account not linked" error that caused the malformed callback URL, so the
  callback cleanup is unnecessary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(auth): guard trusted SSO providers with isSsoEnabled (isTruthy)

env.SSO_ENABLED can be the string "false" (t3-env returns strings for
booleans), which is truthy in JS. Use the canonical isSsoEnabled flag
(isTruthy(env.SSO_ENABLED)) so SSO_ENABLED="false"/"0" correctly yields an
empty trusted-provider list, matching how SSO is gated elsewhere.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* feat(gitlab): sync repository files (code/docs) alongside wiki and issues

* fix(gitlab): follow full keyset next-link for repo tree + skip disabled wiki gracefully in all/both

* fix(gitlab): error on bad user branch (tree 404), warn on resolveRef fallback, normalize pathPrefix to directory boundary

* fix(gitlab): preserve slashes in branch ref for file source URLs (GitFlow branches)

* fix(gitlab): never abort sync on repo-tree 404 (empty repo); validate user branch exists at setup instead

* fix(gitlab): validate ref via commits endpoint so tags and commit SHAs are accepted, not just branches

* fix(gitlab): skip repo phase on tree 403 (missing read_repository) so wiki/issues still sync under all

* fix(byok): add Fal icon and repair corrupted Ollama icon path

The Ollama BYOK icon rendered blank because its SVG path had spaces
stripped between arc-command flags (e.g. `a5.05 5.05 0 12.05-.636`),
producing invalid tokens. Replaced with the canonical Ollama path.

Also added a dedicated FalIcon (was falling back to the generic
ImageIcon) and wired it into the BYOK provider list.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(icons): repair corrupted Fireworks icon arc command

The leftmost spark of the Fireworks icon never rendered because its
third subpath used a corrupted arc command (`a34.59 34.59 0 17.15 37.65`)
with collapsed flags, yielding an invalid sweep-flag of 7 that aborts
the path parse. Replaced with the canonical lobehub Fireworks source.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…less execution (#4870)

* fix(mothership): run client-routed workflow tools server-side in headless execution

Headless Mothership (Mothership block, no browser) could not run workflows.
The run_workflow/run_workflow_until_block/run_block/run_from_block tools are
registered with route 'client', so the executor gate (isSimExecuted) skipped
their registered server handlers and fell through to executeAppTool, throwing
'Tool not found'. Interactive runs delegate these to the browser before reaching
the executor, so only the headless path broke.

Allow a client-routed tool to use its registered server handler when one exists,
which only affects the four run tools (the only client-routed tools, all of which
have server handlers).

* test(mothership): clear handler registry between executor tests

Add clearHandlers() helper and reset the module-level handler registry in
beforeEach so handlers registered in one test do not leak into the next.
…aks (#4869)

* fix(dev): use globalThis for singleton state to prevent HMR memory leaks

* fix(dev): apply globalThis guard to rate-limiter storage factory to prevent listener accumulation

* fix(types): resolve McpConnectionManager globalThis undefined type error
…sSameOrigin (#4873)

* fix(gitlab): pin pagination cursor to configured host before following it

The repository-tree keyset cursor stores GitLab's verbatim rel="next"
URL and re-fetches it with an Authorization: Bearer header. Assert the
cursor's origin matches the configured apiBase before following it, so a
tampered or corrupted fileNextUrl cannot exfiltrate the access token to
an attacker-controlled host. Fails closed on mismatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* improvement(validation): generalize isSameOrigin and reuse across connectors/tools

Add an optional base argument to the shared isSameOrigin (defaulting to
the app base URL) so callers can pin a URL to any trusted origin. The
GitLab connector's cursor host-check and the tools self-origin check now
consume the shared helper instead of their own URL-parsing.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
)

* fix(storage): percent-encode object key in multipart fallback URL

buildObjectFallbackUrl built the object URL from a raw key. Keys with spaces
or reserved characters (and the pre-existing AWS branch) would produce a
structurally invalid location. Encode the key per path segment (preserving
'/' separators) across all branches (AWS, custom path-style, virtual-hosted).

* refactor(storage): clearer per-segment key encoding in fallback URL

* test(storage): cover multipart fallback URL (AWS, R2 virtual-hosted, MinIO path-style, key encoding)
…agnostics) (#4868)

* fix(tables): retry transient DB/Redis failures in cell execution and surface error causes

Workflow-group-cell runs intermittently failed on trivial DB reads/writes
under heavy fan-out, stranding cells in `running`. Investigation showed the
PlanetScale and ElastiCache backends were healthy at the time — the failures
are transient connection-level faults that the cell (maxAttempts: 1) had no
tolerance for, and the real cause was never logged (Drizzle wraps it as
"Failed query: ..." and the driver cause lives in error.cause).

Resilience:
- Add retryTransient (lib/table/retry-transient.ts): retries only transient
  infra errors (reuses isRetryableInfrastructureError; adds an ioredis
  command-timeout match) with jittered backoff, then rethrows. Fail-fast for
  everything else.
- Wrap the cell's getTableById/getRowById reads, the terminal write
  (cell-write updateRow — idempotent via the executionId guard), and the
  Redis cascade-lock acquire.

Diagnostics:
- Add describeError (lib/core/errors/retryable-infrastructure.ts): walks the
  .cause chain and always returns the underlying driver cause (code/errno/
  syscall + causeChain), including for unclassified errors like AbortError.
- Log `cause` + a `retryable` flag (and aborted/timedOut in the cell's main
  catch) across the cell + finalization error paths, mirroring the existing
  schedule-execution pattern. Logging-only; no behavior change. This lets the
  next recurrence reveal the real cause and whether the retry applies.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tables): address review feedback on cell retry resilience

- retryTransient: re-check the abort signal after the backoff sleep so a
  cancellation during sleep stops the next attempt (don't run/return work for
  an already-cancelled task).
- isRetryableRedisError: walk the .cause chain (mirroring the infra
  classifier) so wrapped Redis timeouts are recognized; drop "Connection is in
  subscriber mode" — that's a connection-state programming error, not a
  transient drop, and would just fail identically every retry.
- cascade-lock: stop wrapping acquireLock in retryTransient. acquireLock is a
  non-idempotent SET NX, so retrying after a timed-out-but-applied first SET
  returns false (key already ours) and yields a false `contended` that skips
  the cascade. A transient Redis blip here just fails the run before pickup
  (no stranded cell); the dispatcher re-drives it.
- Tests: cause-chain Redis match, subscriber-mode exclusion, abort-during-sleep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tables): drop out-of-scope abort/timeout fields from cell catch

The main catch logged `aborted`/`timedOut` from `abortSignal`/`timeoutController`,
but those are declared inside the outer try block (the inner try around
executeWorkflow is try/finally, so this catch belongs to the outer try) and are
not in scope in the catch — `next build`'s type-check failed with "Cannot find
name 'abortSignal'". Local incremental `tsc --noEmit` had skipped the file and
falsely passed; the Cursor/Greptile reviewers flagged this correctly.

Removed the two fields. Abort/timeout is still surfaced via `cause:
describeError(err)` (an aborted run shows `name: 'AbortError'` / the timeout
message), so no diagnostic signal is lost.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(tables): drop in-process retry, keep cause diagnostics only

In-process retry is the wrong layer for this path: the cell task is
maxAttempts:1 by design, retrying on a possibly-degraded worker may not help,
and it masks the very transient-failure signal we're trying to capture before
we understand the root cause. Removed retryTransient entirely (file + all
wrapping in cell-write, the cascade reads, and the lock acquire) and kept only
the diagnostic logging.

- Deleted lib/table/retry-transient.ts (+ test); cell-write and the cascade
  reads call getTableById/getRowById/updateRow directly again, fail-fast.
- Kept describeError + `cause`/`retryable` fields across the cell + finalization
  catch blocks; the cell-path `retryable` flag now sources from
  isRetryableInfrastructureError (the canonical classifier) for consistency.

Diagnostics-first: surface the real driver cause on the next recurrence, then
decide the actual fix (e.g. task-level maxAttempts, or addressing the worker-
side cause) from evidence rather than a speculative in-process retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(schedules): log error cause on scheduled-execution failure paths

The scheduled-job failure paths logged the raw error (.message/stack only) —
its `.cause` (the real driver error behind a Drizzle "Failed query: ..."
wrapper) was never recorded, and the classified-only
`describeRetryableInfrastructureError` returns undefined for unrecognized
errors. A real failed run (same incident window as the cell failures) failed in
`applyScheduleUpdate` with exactly this unrecorded cause.

Added `cause: describeError(error)` (always-on, walks the cause chain) to the
applyScheduleUpdate catch, the early-failure catch, and the unhandled-error
catch — passed as a second arg so the existing message+stack still emit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(errors): move describeError to @sim/utils/errors

`describeError` is a general-purpose error/cause-chain helper — it didn't
belong in `lib/core/errors/retryable-infrastructure.ts` (that module is
specifically about classifying retryable infra errors, and the name read wrong
for a generic diagnostic). Moved it to `@sim/utils/errors` alongside `toError`/
`getErrorMessage`/`getPostgresErrorCode`, with its own cycle-safe cause walk.

- Added describeError + DescribedError + tests to packages/utils/src/errors.ts.
- Reverted the describeError addition from retryable-infrastructure.ts (it keeps
  only isRetryableInfrastructureError / describeRetryableInfrastructureError,
  which are accurately named and still used by the schedule retry path).
- Re-pointed all consumers (cell, logging-session, pause-persistence, schedule)
  to import describeError from @sim/utils/errors. The `retryable` classification
  flag still sources from isRetryableInfrastructureError where used.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dgebase connector, SSO provider ID allowlist, singleton memory leak fix
)

* feat(tables): background import for large CSVs with live progress

* fix(tables): address review — import heartbeat, overlap guard, column/empty validation

* fix(tables): guard sync import overlap, scope fileKey to workspace, delete-on-replace after download

* fix(tables): stream large CSV imports from storage instead of buffering the whole file

* test(tables): fix async-import route tests for workspace-scoped fileKey + name uniquification

* fix(tables): append imports start after existing rows; reconcile missed import failures in the tray

* fix(tables): delete the uploaded CSV from storage after the import finishes

* fix(tables): validate replace before deleting rows; ignore stale replayed import events by importId

* fix(tables): bind import worker to its importId (no stale-worker clobber/overlap) and destroy storage stream on failure

* feat(tables): byte-based import progress, cancel support, and a start toast that opens the import view

* fix(tables): don't emit ready after cancel; honor cancel during the upload phase

* improvement(tables): use a stop (square) icon for canceling an active import

* fix(tables): make markTableImporting an atomic claim to close the concurrent-import TOCTOU race

* improvement(tables): preview CSV import from a slice, drop client row-count warning

The import dialog parsed the entire file in the browser to show an exact row
count and a row-limit warning. That holds the whole file in memory, blocks the
main thread, and hits V8's ~512MB string ceiling — so the dialog capped the
effective import size well below what the streaming importer handles.

Parse only the first 512KB (headers + sample for the mapping); drop the exact
count and the "would exceed the row limit by N" gate. The DB row-count trigger
already enforces max_rows server-side, so an over-limit import fails fast during
the run with a clear message instead of being blocked by an expensive parse.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tables): gate import ownership every batch and stop canceled imports reappearing

- Worker checked run ownership only at the progress cadence (~every 5k rows), so
  a canceled/superseded import could insert several more batches (incl. the final
  partial batch) before stopping. Move the updateImportProgress ownership gate to
  the top of every flush — a run that lost the table stops within one batch.
- A list/dialog import canceled mid-upload left the server row `importing` until
  the in-flight server cancel landed; hydration re-seeded it from useTablesList,
  so the dismissed import flickered back. Flag the real table id canceled on the
  mid-upload cancel path, skip re-seeding flagged tables in hydration, and clear
  the flag once the server import is terminal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(tables): drive import tray by polling derived from server, not SSE

Import progress no longer holds an SSE connection per importing table. The tray
now derives its importing rows live from the table list (React Query), polled
only while an import is in flight; the table detail page keeps its own
cell-state SSE for grid refresh.

- store holds only client-only state now: optimistic uploads, which terminal
  completions to surface this session, canceled ids, menu open — no copied
  importStatus/rowsProcessed.
- useWorkspaceImports is the single source: polls via a data-predicate
  refetchInterval, derives rows, and fires completion toasts on the
  importing -> terminal transition.
- kickoff handlers use startUpload/setUploadPercent/endUpload; the invalidated
  list refetch surfaces the server row and polling takes over.
- removes use-hydrate-import-tray + use-import-progress-tracker (folded in).
- trims over-verbose comments across the import paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tables): ignore superseded-run import events in the detail SSE cache

applyImport applied every replayed import payload to the detail cache. The SSE
buffer can replay a prior import's terminal event for the same table, stomping a
newer in-flight import's UI. Lock to the active run's importId (and ignore a
replayed terminal before the id is known), matching the guard the header tracker
used to have.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tables): close sync-import TOCTOU by claiming the atomic import gate

The sync import route checked importStatus from a checkAccess snapshot, then
parsed/validated/wrote seconds later without taking the atomic claim. A
concurrent async kickoff (markTableImporting) could slip into that window and
both writers would run together — for replace mode, two delete+insert passes
leave the table indeterminate.

Claim the same atomic gate (markTableImporting) right before the write and
release it in the finally (before the response returns, so a client refetch
never sees the transient status). A row-level FOR UPDATE was avoided on purpose:
it would invert lock order against the position advisory lock / row-count
trigger and risk a deadlock — markTableImporting is the established gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(multipart): keep abort wired after resolve so a mid-upload disconnect tears down the stream

readMultipart resolves on the file-part header and hands the caller an un-drained
stream, but settle() ran cleanup() and detached the abort listener on that path
too. A client disconnect mid-upload then destroyed nothing — busboy never saw EOF,
the file stream stalled, and the route's `for await` held a request slot until
maxDuration (300s). Re-arm an abort handler scoped to the file stream on resolve,
detached when the stream closes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntState verification (#4877)

* fix(chat): prevent XSS in attachment preview via filename/data URL escaping

Replace document.write with an escaped blob URL preview: HTML-entity
encode the user-controlled filename and data URL, open with
noopener,noreferrer, and revoke the blob URL after navigation.

* fix(mcp): guard OAuth discovery and token revocation against SSRF

Route discoverOAuthServerInfo and the RFC 7009 revocation POST through an
SSRF-guarded fetch that validates every request URL via validateMcpServerSsrf
(blocking private/reserved/loopback targets, honoring ALLOWED_MCP_DOMAINS and
self-hosted localhost rules) and pins the connection to the resolved IP to
prevent DNS-rebinding TOCTOU. Previously these fetches used unvalidated global
fetch against URLs taken verbatim from attacker-controllable
authorization-server metadata.

* fix(webhooks): verify Graph clientState on Teams chat-subscription notifications

The microsoftteams_chat_subscription trigger set clientState=webhook.id when
creating the Graph subscription but never validated it on inbound change
notifications, so any request to the webhook path with a crafted notification
body was treated as authentic (CWE-345). verifyAuth now requires every
notification in the value array to carry a clientState matching the stored
webhook id (constant-time compare) and rejects payloads without notifications.
Validation handshakes (validationToken) are handled before auth and remain
unaffected; outgoing-webhook HMAC auth is unchanged.

* fix(webhooks): fail closed when Teams chat-subscription webhook id is unavailable

Hardens the clientState check so a missing webhook id (theoretically
unreachable, since the row is looked up by primary key) can never collapse
the expected value to an empty string that a forged clientState could match.

* docs(mcp): note AbortSignal does not bound SSRF-guard DNS lookup

* improvement(chat): hoist HTML escape map to module-level constant
* fix(mcp): enforce tool name validation in deploy modal

* fix(mcp): correct cn import path to fix build

* fix(mcp): align tool-name regex with server sanitization, add disabled-combobox hint
… .agents/skills and expand add-model touchpoints (#4882)

* chore(skills): mirror model/enrichment/hosted-key/council skills into .agents/skills and expand add-model touchpoints

* chore(skills): document council yaml omission and disambiguate validate-model cross-ref
…ool routes (#4884)

* fix(polling-tools): pass plan execution timeout to internal polling tool routes

* address comments
Reads and writes are fully cut over to the normalized copilot_messages table
(verified in production: no writes to the column in 24h, recently-active chats
have empty JSONB while copilot_messages holds the transcript). Drop the dead
column via drizzle migration 0225 and re-type CopilotChatDetailRow.messages as
an assembled (non-column) field.

Deploy notes: reconcile any chats where the JSONB still leads copilot_messages
before applying, and pg_repack copilot_chats afterward to reclaim the ~5.7GB
TOAST storage (DROP COLUMN is metadata-only).
…form, Azure DevOps, YouTube, JSM, S3, Sentry) (#4880)

* feat(connectors): add 7 knowledge base connectors (Google Forms, Typeform, Azure DevOps, YouTube, JSM, S3, Sentry)

* fix(connectors): tighten listingCapped semantics per review (WIQL cap, batch omissions, cap-vs-exhaustion)

* fix(connectors): google-forms listingCapped must fire on slice regardless of hitLimit (404-null-filter gap)

* fix(connectors): s3 streaming size cap for chunked responses without content-length

* fix(connectors): ado byte-exact file content fetch, google-forms hash-poisoning on listing failure

* fix(connectors): ado auth-failure deletion guard, jsm last-page slice flag, google-forms response cap in hash

* fix(connectors): shared streaming size-cap reader for ado file hydration (promote from s3)

* fix(knowledge): flag incomplete listings at engine level when pagination is truncated

* fix(connectors): ado flags listing incomplete when a non-empty repo has no resolvable branch

* fix(knowledge): engine truncation flag is an absolute deletion block (fullSync cannot override); s3 byte-exact size fallback; ado tsdoc accuracy

* improvement(knowledge): extract shouldReconcileDeletions gate as tested pure function, tighten engine comments

* test(connectors): mapTags coverage for the 7 new connectors

* fix(connectors): ado probes past the wiql 20k cap before flagging; document custom-wiql full-listing behavior

* fix(connectors): ado flags partial repo trees when items listing emits a continuation token

* fix(connectors): ado discards foreign-phase cursors; google-forms scans all response pages for change detection

* fix(connectors): audit fixes across new connectors

- registry: register x connector (was dead code, never wired in)
- google-docs/google-drive/google-forms: gate deletion reconciliation on
  Drive incompleteSearch; google-docs also now sets listingCapped on its
  maxDocs cap path
- jsm: add read:jira-user scope so reporter resolves on requests
- gong: only set listingCapped on genuine truncation, not exact-cap
  source exhaustion
- gitlab: issues phase switched to keyset pagination (removes ~50k
  offset ceiling), matching the repo-tree phase
- grain: parallelize recording + transcript fetch in getDocument
- ashby: document updatedAt-based content-hash limitation for
  notes/feedback change detection
- tests: mapTags coverage for x, granola, greenhouse, fathom, rootly
…d tools (#4883)

* feat(integrations): add ClickHouse block and expand Dagster + Tinybird tools

* fix(tinybird): fail loudly on invalid query_pipe parameters JSON

parsePipeParameters previously returned {} on any JSON parse error, so a
mistyped 'parameters' input produced a successful pipe call with the dynamic
filters silently dropped. Throw a clear error for non-empty, non-object input
instead; an omitted/empty value still means 'no parameters'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(dagster): guard NaN numeric coercions and bound list_assets pagination

Address PR review:
- Route all block numeric coercions (list_runs limit/createdAfter/createdBefore,
  get_run_logs logsLimit, list_assets assetsLimit) through a toFiniteNumber()
  guard so invalid/wand-generated text becomes undefined instead of NaN.
- list_assets now applies a default page size (100) when no limit is given, so
  paging stays bounded and hasMore is meaningful even when limit is omitted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(dagster): make list_assets hasMore exact via fetch N+1

Address PR review (hasMore true on exact page): request one extra row
(pageSize + 1), use its presence as the authoritative hasMore, slice it off,
and derive the returned cursor from the last RETURNED asset's key path
(JSON-serialized; Dagster normalizes JS/Python whitespace on the way in).
This removes the false-positive hasMore when the final page is exactly full.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(clickhouse): enforce read-only query operation and harden WHERE-clause guard

* fix(dagster): make list_runs hasMore exact via fetch N+1

Address PR review (list runs false hasMore): request one extra row
(pageSize + 1), use its presence as the authoritative hasMore, and slice it
off before mapping. Removes the false-positive hasMore (and misleading cursor)
when the final page is exactly `limit` runs long. Mirrors the list_assets fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(clickhouse): restrict DROP PARTITION to literal values to prevent SQL injection

* fix(clickhouse): reject chained statements in read-only query operation

* fix(clickhouse): force JSON output on query path and ignore comments when detecting chained statements

* fix(tinybird): encode datasource/pipe names in URL paths to prevent traversal

A user-or-llm datasource/pipe name interpolated raw into the URL path (e.g.
'real_ds/../../other') is normalized by the WHATWG URL parser and can target a
different endpoint. Wrap the path segment with encodeURIComponent in the
truncate, delete, and query_pipe URLs. Events/append pass the name via
URLSearchParams, which already encodes, so they were unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(clickhouse): block WITH-led writes/DDL in read-only query operation

* fix(clickhouse): validate column types structurally and normalize FORMAT around SETTINGS

* fix(clickhouse): balance-check ORDER BY/PARTITION BY and skip leading comments in read-only guard

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* fix(autolayout): relocate notes that overlap blocks after layout

* fix(autolayout): harden note overlap resolution against resize and non-finite positions
* feat(metrics): emit hosted-key metrics to Grafana via OTel

Replace the dropped platform.hosted_key.* spans with OTel counters/histograms for usage, cost, failures, throttles, and queue waits. Wire a MeterProvider into the Next.js OTel SDK (trigger.dev already exports metrics). Per-key attribution via a key label (env var name).

* fix(metrics): correct hosted-key failure attribution

- Re-point used/cost/failed labels at the freshly acquired key after reacquire
- Classify quota-style 401/403 as rate_limited (mirror isRateLimitError)
- Count returned success:false runs (e.g. deep_research polling) as failed

* fix(metrics): label hosted_key.throttled with real provider on exhausted retries

* fix(metrics): parse OTLP metrics URL via URL/pathname, not string suffix

Handles query strings and trailing slashes so the /v1/traces->/v1/metrics
swap can't produce a malformed endpoint, matching normalizeOtlpTracesUrl.
- Route both row-GET endpoints (internal + v1) and the copilot tool through
  the single service.queryRows instead of three inline query copies; add a
  withExecutions option so the public v1 route still omits executions.
- Run COUNT(*) and the page fetch concurrently in queryRows.
- Move CSV-import transaction ownership out of the API route into
  importAppendRows / importReplaceRows so routes never hold a trx.
- Extract row position mechanics (reserve / shift / compact) into named
  private helpers in service.ts; no separate table-wrapper module.
…d/no-output badges (#4889)

* feat(tables): workflow version selection (live/deployed) and not-found/no-output badges

* fix(tables): draw row-selection left edge as checkbox cell border so it cannot be cut off

* fix(tables): per-group version in cascade, accurate deploy error, skip not-found for deployed groups

* fix(tables): render selection left edge as continuous strip overlapping row gridlines

* feat(tables): not-found column icon, optional workflow inputs, mothership deploymentMode

---------

Co-authored-by: waleed <walif6@gmail.com>
All app replicas shared a hardcoded service.instance.id ("mothership-sim"),
so OTel metrics from every process collapsed into one Prometheus series.
Their independent cumulative counters then interleaved, producing phantom
counter resets that corrupt rate()/increase() — staging hosted-key cost
inflated to ~$0.72 from a few cents, while no-`key` metrics (cost_charged,
throttled, queue_wait_*) were affected fleet-wide.

Append the hostname (the container id under ECS, unique per task) so each
replica gets its own series and sum(rate(...)) / sum(increase(...)) aggregate
correctly. The mothership-sim prefix is kept so Jaeger's clock-skew adjuster
still separates Sim from Go.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lag-gated, default off) (#4890)

* feat(tables): add order_key column, fractional-indexing util, and ordering flag (off)

* feat(tables): write order_key on insert, flag-gate delete reindex + query ordering, add backfill

Flag off (default) = identical behavior. Single-insert assigns a fractional
order_key; queryRows orders by order_key when the flag is on; deletes skip the
O(N) reindex when on. Per-table-atomic backfill script populates existing rows.

* feat(tables): write order_key on all insert paths (batch, upsert, replace, import, create, copilot)

Completes the always-write-keys prerequisite: every row insert now assigns a
fractional order_key consistent with position order, so the flag can be flipped
safely after backfill. Flag off (default) still = identical behavior.

* feat(tables): insert-by-neighbor-id + orderKey on wire + client order-by-key

Inserts express intent as afterRowId/beforeRowId (O(1) key mint via the
(table_id,order_key,id) index); orderKey is returned on every row; client
reconcile/undo place by orderKey (no neighbor bump) with position fallback.
Flag off = unchanged. 205 table tests pass.

* feat(tables): resolve position-based inserts by key ordinal under the flag

Position-based callers (mothership tool, v1 API, undo fallback, transient old
clients) resolve their insert neighbor by order_key ordinal (OFFSET) when the
flag is on — positions are gappy then, so WHERE position=N would miss. Flag off
keeps the indexed position lookup. The mothership tool itself is unchanged.

* test(tables): flag-on coverage — delete skips reindex, insert mints key + no shift

* fix lint

* chore(db): regenerate order_key migration with default drizzle name

* fix(tables): address review — guard neighbor insert + mutual-exclusion + safe reconcile

- resolveInsertByNeighbor throws when the anchor row is missing (was silently
  inserting at the front) and when its order_key is null under the flag.
- insert contract: afterRowId/beforeRowId are mutually exclusive (refine).
- reconcileCreatedRow only key-sorts when every cached row is keyed, so mid-
  backfill un-keyed rows aren't yanked to the front.

* fix(kb): restore non-null guard in storage-key filter (unsafe-lint regression)

* refactor(tables): extract maxOrderKey + thread import append key

- Extract maxOrderKey(executor, tableId) helper; replaces three identical
  max(order_key) selects (single/batch insert append + import).
- Import: read the append anchor once up front and thread each batch's last
  key forward (nextImportStartOrderKey + afterOrderKey) instead of re-scanning
  max(order_key) per batch over a growing table — one scan per import, not
  one per 1k-row batch.

* fix(tables): keep insert body base omittable for v1 contract

The afterRowId/beforeRowId mutual-exclusion .refine() turned the schema into a
ZodEffects, which Zod forbids .omit() on — v1's insertTableRowBodySchema.omit({
position }) threw at module load (runtime-only; tsc misses it). Split the plain
object base out, apply the shared refine on top, and have v1 omit from the base
then re-apply it.

* fix(tables): chunk backfill order-key writes

A single UPDATE … FROM (VALUES …) over a whole large table overflows the JS call
stack while drizzle assembles the VALUES list (and would blow past Postgres's
65535 bound-param limit at ~32k rows) — large tables failed with 'Maximum call
stack size exceeded'. Write in 1000-row chunks inside the same per-table
transaction so keying stays atomic.

* fix(tables): emit orderKey in insert responses

The single-row and batch insert handlers dropped orderKey from the JSON
response even though the service returns it, so reconcileCreatedRow always fell
back to position-sorting and could place neighbor inserts wrong under the
fractional-ordering flag. Serialize orderKey alongside position.

* fix(tables): restore by orderKey, not position, under fractional flag

A saved position is the gappy column value, but under the flag insert reads
position as a visual rank (OFFSET) — so position-based restore misplaces rows.

- create-row redo now goes through the batch path carrying the saved orderKey
  (the single-insert API has no orderKey field); drop the now-unused single
  create mutation.
- resolveBatchInsertOrderKeys appends under the flag instead of feeding gappy
  positions to resolveInsertOrderKey; positions remain the flag-off path.

* perf(tables): backfill writes 5000 rows/chunk (was 1000)

5x fewer round-trips per table; ~10k bound params stays well under Postgres's
65535 ceiling and far below the single-statement size that overflows the stack.

* fix(tables): drop rowNumber from table trigger payload

position is gappy under the fractional-ordering flag, so rowNumber (= row.position)
no longer reflects a contiguous visual rank. Rather than compute-on-read, remove
it from the trigger payload, output schema, and column-execution input.

Also pin isTablesFractionalOrderingEnabled=false in update-row.test.ts so its
flag-off position-shift assertions are deterministic regardless of local env.

* chore(db): format generated 0226 migration metadata

biome check . flagged the drizzle-generated _journal.json and 0226_snapshot.json;
apply the formatter so packages/db lint:check passes in CI.

* docs(triggers): drop rowNumber from table trigger outputs

rowNumber was removed from the table trigger payload; remove it from the
documented output fields to match.

* test(tables): remove flag-on fractional-ordering unit suite

Flag-on behavior is covered by manual large-table verification; the heavily-
mocked DB-chain suite added little signal.
…ERE-clause validation (#4895)

* fix(clickhouse): centralize WHERE-clause validation in input-validation and harden tautology detection

* fix(clickhouse): enforce server-side readonly=1 on the query operation

* fix(clickhouse): allow BETWEEN bounds in WHERE validation (OR-only literal rule) and dedupe JSDoc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants